Analysis part of the R-exercise 4

1 Loading the inbuilt dataset Boston

library(MASS)
# load the data
data("Boston")

2 A brief overlook over the data

library(MASS)
# load the data
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Boston dataframe has 506 observations of 14 variables:

  • crim - per capita crime rate by town.
  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus - proportion of non-retail business acres per town.
  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox - nitrogen oxides concentration (parts per 10 million).
  • rm - average number of rooms per dwelling.
  • age - proportion of owner-occupied units built prior to 1940.
  • dis - weighted mean of distances to five Boston employment centres.
  • rad- index of accessibility to radial highways.
  • tax - full-value property-tax rate per $10,000.
  • ptratio - pupil-teacher ratio by town.
  • black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat - lower status of the population (percent).
  • medv - median value of owner-occupied homes in $1000s.

The information about the dataset can be found here. ##3 Plotting the data

# plot matrix of the variables
pairs(Boston)

Pairs graph is a mess so far, so let’s wait if we can discard some variables and then maybe try drawing it again…

Data correlations

Let’s draw the correlations between the variables.

library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)

We can see that the most correlated variables are

library(reshape2)
CM <- cor_matrix                           # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA       # lower tri and diag set to NA
subset(melt(CM, na.rm = TRUE), abs(value) > .7)
##      Var1 Var2 value
## 59  indus  nox  0.76
## 89    nox  age  0.73
## 101 indus  dis -0.71
## 103   nox  dis -0.77
## 105   age  dis -0.75
## 129 indus  tax  0.72
## 135   rad  tax  0.91
## 195 lstat medv -0.74

Variables rad and tax are highly positively correlated (0.91). This can be interpretated so that the properties with better access to city’s radial highways also have higher property tax.

Next ones are nox and dis with -0.77 negative correlation. Interpretation is that smaller the weighted average distance to the five Boston employment centers, the larger is the nitrogen oxides concentration (worse air pollution).

As a third, with 0.76 positive correlation can be found ìndus and nox. Interpretation is that higher the proportion of non-retail business acres in town, higher is the concentration of the nitrogen oxides.

(One thing that can be noted is that the status of the inhabitants and the number of rooms correlate strongly with the housing value.)

For convenience, let’s plot histograms only for these variables of interest.

Distributions of the chosen variables

library(ggpubr)
library(ggplot2)
x <- seq(1, length.out=dim(Boston)[1])

rad_plot<-ggplot(Boston, aes(x=rad)) +
geom_histogram()

tax_plot <- ggplot(Boston, aes(x=tax)) +
geom_histogram()

nox_plot <- ggplot(Boston, aes(x=nox)) +
geom_histogram()

dis_plot <- ggplot(Boston, aes(x=dis)) +
geom_histogram()

indus_plot <- ggplot(Boston, aes(x=indus)) +
geom_histogram()

ggarrange(rad_plot, tax_plot, nox_plot, dis_plot, indus_plot, 
          labels = c("A", "B", "C", "D", "E"),
          ncol = 3, nrow = 2)

Plot A rad is index of accessibility to radial highways. There are two concentrations in the graph, first one is index of 4-5 and the second index 24. There are 132 observations in the latter one (with index value 24), which is a high percentage of all the observations. It would make sense to check out the map of Boston to better understand the outliers.

Plot B tax has the full-value property-tax rate, and the distribution is pretty similar to what we had in plot A, i.e. concentration in the first quartile and high amount of outliers in the last quartile (132 observations, value 666).

Plots C and D (nox and dis) have also similar distributions skewed to the right.

Plot E indus is skewed to the right, and has a high count (132 observations) on a single value (18.1).

There seems to be 132 observations in the data, that are outliers for some reason.

4 Standardising the dataset

boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaled dataset is of type matrix, so we’ll convert it into a dataframe. In the new dataframe all the variables have zero mean.

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Linear discriminant analysis produces results based on the assumptions that + the variables are normally distributed (on condition of the classes), + the normal distributions for each class share the same covariance matrix, and + the variables are continuous

Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

# look at the table of the new factor crime
#table(crime)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5 Fitting and plotting the linear discriminant analysis on the train set

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
#lda.fit

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

From the graph we can see the variables that imply high crime, the most important factor being rad (the arrows pointing at the high crime cluster).

6 Predict the classes with the LDA model on the test data

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       23       9        0    0
##   med_low    6      11        8    0
##   med_high   0       7       10    2
##   high       0       0        0   26

The model seems to predict high crime very well (25 out of 25 are correct) and taking a look at the predictions as a whole, the model seems to do better predicting higher rates. In all cases, the model had predicted correctly more often than wrong.

7 Standardising the original dataset and calculating distances

First we calculate the Euclidean distance, and then with Manhattan distance.

library(factoextra)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- get_dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Let’s visualise the distances between the variables

fviz_dist(dist_eu,  show_labels = TRUE, lab_size = 0.6, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

Calculating the K-means

############
# K-means  #
############
k_max <- 10
set.seed(123)

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results

qplot(x = 1:k_max, y = twcss, geom = 'line')

From the graph we choose to have two centers, as it is the point where the slope of the line changes from steep to level. Let’s run K-means with two centers.

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)

    
library(tidyverse)
library(data.table)
boston_scaled %>%
  mutate(Cluster = km$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean") %>%
DT::datatable(extensions = 'FixedColumns', options = list(dom = 't',
  scrollX = TRUE,
  scrollCollapse = TRUE))

You can scroll the datatable to see the rest of the variables, which are summarised by their means grouped by cluster number. We can see that there are differences between the two clusters, e.g. cluster #1 has higher crime, bigger proportion of non-retail business acres, more pollution, buildings are older and it’s further from the employment centers, there are less teachers per pupil and the median value of owner-occupied homes is lower.

Lets draw pairs according to the two clusters:

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

Color red is cluster #1 and black is nicer neighbourhood cluster #2.

We can see from the graph what we earlier saw numerically from the datatable. In the upper row we have crime against the variables, and we can see that cluster #1 has higher crime rates in every aspect than cluster #2.

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[1:5], col = km$cluster)

If we compare crim and indus we see that the red cluster has higher crime and it is more industrialised than the black one. It’s the same with variable nox, and it’s logical that there is also more air pollution. zn tells us that red cluster is not near the river area, unlike the black cluster.

We could go on and on analysing all the variables in the graph, but I’m certain it’s not the meaning of this exercise.

Bonus

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
k3 <- kmeans(boston_scaled, centers = 3)
k4 <- kmeans(boston_scaled, centers = 4)
k5 <- kmeans(boston_scaled, centers = 5)
k6 <- kmeans(boston_scaled, centers = 6)
k7 <- kmeans(boston_scaled, centers = 7)

# plots to compare
p2 <- fviz_cluster(k3, geom = "point",  data = boston_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = boston_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = boston_scaled) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point",  data = boston_scaled) + ggtitle("k = 6")
p6 <- fviz_cluster(k7, geom = "point",  data = boston_scaled) + ggtitle("k = 7")

library(gridExtra)
grid.arrange(p2, p3, p4, p5, p6, nrow = 3)

I choose k=3 as it has the best separation in my opinion.

set.seed(123)
km <- boston_scaled %>%
  kmeans(centers = 3)
lda.fit <-lda(km$cluster~.,data=boston_scaled)

classes<-as.numeric(km$cluster)
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 4)

Rad and tax (and maybe indus) are the most influencial linear separators for cluster #2. Age and chas for cluster #1. Dis and rm for cluster #3. Black affects towards clusters #1 and #3.